{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5c74b242",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/tomasonjo/blogs/blob/master/subgraph_filtering/Subgraph%20filtering.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "LBXYBWeBIHHS",
   "metadata": {
    "id": "LBXYBWeBIHHS"
   },
   "source": [
    "* Updated to GDS 2.0 version\n",
    "* Link to original blog post: https://towardsdatascience.com/subgraph-filtering-in-neo4j-graph-data-science-library-f0676d8d6134"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "tjfgQRwQIObk",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "tjfgQRwQIObk",
    "outputId": "c173abb3-3a23-4978-9c8c-7c4cdd252c7f"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting neo4j\n",
      "  Downloading neo4j-4.4.2.tar.gz (89 kB)\n",
      "\u001b[?25l\r",
      "\u001b[K     |███▋                            | 10 kB 24.9 MB/s eta 0:00:01\r",
      "\u001b[K     |███████▎                        | 20 kB 14.0 MB/s eta 0:00:01\r",
      "\u001b[K     |███████████                     | 30 kB 10.5 MB/s eta 0:00:01\r",
      "\u001b[K     |██████████████▋                 | 40 kB 9.2 MB/s eta 0:00:01\r",
      "\u001b[K     |██████████████████▎             | 51 kB 3.9 MB/s eta 0:00:01\r",
      "\u001b[K     |██████████████████████          | 61 kB 4.6 MB/s eta 0:00:01\r",
      "\u001b[K     |█████████████████████████▋      | 71 kB 5.2 MB/s eta 0:00:01\r",
      "\u001b[K     |█████████████████████████████▎  | 81 kB 5.9 MB/s eta 0:00:01\r",
      "\u001b[K     |████████████████████████████████| 89 kB 3.1 MB/s \n",
      "\u001b[?25hRequirement already satisfied: pytz in /usr/local/lib/python3.7/dist-packages (from neo4j) (2022.1)\n",
      "Building wheels for collected packages: neo4j\n",
      "  Building wheel for neo4j (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
      "  Created wheel for neo4j: filename=neo4j-4.4.2-py3-none-any.whl size=115365 sha256=f0ec223c3b5f884016c256349d73444daf76c85be90afb25ed5f96cb74176fff\n",
      "  Stored in directory: /root/.cache/pip/wheels/10/d6/28/95029d7f69690dbc3b93e4933197357987de34fbd44b50a0e4\n",
      "Successfully built neo4j\n",
      "Installing collected packages: neo4j\n",
      "Successfully installed neo4j-4.4.2\n"
     ]
    }
   ],
   "source": [
    "!pip install neo4j"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8e2e129a",
   "metadata": {
    "id": "8e2e129a"
   },
   "outputs": [],
   "source": [
    "# Define Neo4j connections\n",
    "import pandas as pd\n",
    "from neo4j import GraphDatabase\n",
    "host = 'bolt://3.231.25.240:7687'\n",
    "user = 'neo4j'\n",
    "password = 'hatchets-visitor-axes'\n",
    "driver = GraphDatabase.driver(host,auth=(user, password))\n",
    "\n",
    "def run_query(query, params={}):\n",
    "    with driver.session() as session:\n",
    "        result = session.run(query, params)\n",
    "        return pd.DataFrame([r.values() for r in result], columns=result.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a449fbd",
   "metadata": {
    "id": "6a449fbd"
   },
   "source": [
    "It has been a while since I wrote a post about new features in the Neo4j Graph Data Science library (GDS). For those of you that never heard of the GDS library, it features more than 50 graph algorithms ranging from community detection to node embedding algorithms and more. In this blog post, I will present Subgraph filtering, one of the library's newer features.\n",
    "\n",
    "Neo4j Graph Data Science library uses the Graph Loader component to project an in-memory graph. The in-memory project graph is separate from the stored graph in the Neo4j database. The GDS library then uses the in-memory graph projection, optimized for topology and property lookup operations, to execute graph algorithms. You can use either Native or Cypher projections to project an in-memory graph. In addition, subgraph filtering allows you to create a new projected in-memory graph based on an existing projected graph. For example, you could project a graph, identify the weakly connected components within that network, and then use subgraph filtering to create a new projected graph that consists only of the largest component in the network. This allows you a smoother graph data science workflow, where you don't have to store intermediate results back to the database and then use Graph Loader to project a new in-memory graph.\n",
    "# Graph model\n",
    "In this blog post, we will be using the Harry Potter network dataset I have created in one of my previous blog posts. It consists of interactions between characters in the Harry Potter and the Philosopher's Stone book.\n",
    "The graph schema is relatively simple. It consists of characters and their interactions. We know the name of the characters and when they first appeared in the book (firstSeen). The INTERACTS relationship holds the information about how many times two characters have interacted (weight) and when they first interacted (firstSeen)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9e0e6a95",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 81
    },
    "id": "9e0e6a95",
    "outputId": "33da8c4c-1210-452e-8158-346fb2f5630b"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>result</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>imported characters</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                result\n",
       "0  imported characters"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "LOAD CSV WITH HEADERS FROM \"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/HP/character_first_seen.csv\" as row\n",
    "MERGE (c:Character{name:row.name})\n",
    "SET c.firstSeen = toInteger(row.value)\n",
    "RETURN distinct 'imported characters' as result\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "95de86a1",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 81
    },
    "id": "95de86a1",
    "outputId": "1422a9e9-f695-4f26-d5dc-b740d726a31e"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>result</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>imported relationships</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   result\n",
       "0  imported relationships"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "LOAD CSV WITH HEADERS FROM \"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/HP/HP_rels.csv\" as row\n",
    "MATCH (s:Character{name:row.source})\n",
    "MATCH (t:Character{name:row.target})\n",
    "MERGE (s)-[i:INTERACTS]->(t)\n",
    "SET i.weight = toInteger(row.weight),\n",
    "    i.firstSeen = toInteger(row.first_seen)\n",
    "RETURN distinct 'imported relationships' as result\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a08f9236",
   "metadata": {
    "id": "a08f9236"
   },
   "source": [
    "# Subgraph filtering\n",
    "As mention, the goal of this blog post is to demonstrate the power of subgraph filtering. We will not delve into specific algorithms and how they work. We will begin by projecting an in-memory graph with Native projections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "64b2fd7c",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 81
    },
    "id": "64b2fd7c",
    "outputId": "c502fa6e-0b24-4fc0-f10d-f67fcfc23516"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>nodeProjection</th>\n",
       "      <th>relationshipProjection</th>\n",
       "      <th>graphName</th>\n",
       "      <th>nodeCount</th>\n",
       "      <th>relationshipCount</th>\n",
       "      <th>projectMillis</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>{'Character': {'label': 'Character', 'properti...</td>\n",
       "      <td>{'INTERACTS': {'orientation': 'UNDIRECTED', 'i...</td>\n",
       "      <td>interactions</td>\n",
       "      <td>120</td>\n",
       "      <td>806</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      nodeProjection  \\\n",
       "0  {'Character': {'label': 'Character', 'properti...   \n",
       "\n",
       "                              relationshipProjection     graphName  nodeCount  \\\n",
       "0  {'INTERACTS': {'orientation': 'UNDIRECTED', 'i...  interactions        120   \n",
       "\n",
       "   relationshipCount  projectMillis  \n",
       "0                806             40  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "CALL gds.graph.project('interactions',\n",
    "  'Character',\n",
    "  {INTERACTS : {orientation:'UNDIRECTED'}},\n",
    "  {nodeProperties:['firstSeen'], \n",
    "   relationshipProperties: ['firstSeen', 'weight']})\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bef61084",
   "metadata": {
    "id": "bef61084"
   },
   "source": [
    "We have projected an in-memory graph under the \"interactions\" name. The projected graph includes all Character nodes and their firstSeen properties. We have also defined that we want to project the INTERACTS relationships as undirected and include both firstSeen and weight properties.\n",
    "\n",
    "Now that we have the projected named graph, we can go ahead and execute any of the graph algorithms on it. Here, I have chosen to run the Weakly Connected Components algorithm (WCC). The WCC algorithm is used to identify disconnected parts of your network that are also known as components."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b57ca4e4",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 81
    },
    "id": "b57ca4e4",
    "outputId": "57c86e13-6672-4c3a-87d2-c6eaae9e74f1"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>componentCount</th>\n",
       "      <th>componentDistribution</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>{'p99': 110, 'min': 1, 'max': 110, 'mean': 30....</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   componentCount                              componentDistribution\n",
       "0               4  {'p99': 110, 'min': 1, 'max': 110, 'mean': 30...."
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "CALL gds.wcc.stats('interactions')\n",
    "YIELD componentCount, componentDistribution\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "804ab4df",
   "metadata": {
    "id": "804ab4df"
   },
   "source": [
    "You can use the stats mode of the algorithm when you are only interested in the high-level overview of the results and have no wish to store the results back to Neo4j or to the projected graph. We can observe that there are four components in our network, the largest having 110 members. Now, we will begin with subgraph filtering. The syntax for the subgraph filtering procedure is as follows:\n",
    "\n",
    "```\n",
    "CALL gds.beta.graph.project.subgraph(\n",
    "  graphName: String, -> name of the new projected graph\n",
    "  fromGraphName: String, -> name of the existing projected graph\n",
    "  nodeFilter: String, -> predicate used to filter nodes\n",
    "  relationshipFilter: String -> predicate used to filter relationships\n",
    ")\n",
    "```\n",
    "You can use the nodeFilter parameter to filter nodes based on node properties or labels. Similarly, you can use relationshipFilter parameter to filter relationships based on their properties and types. There is only a single node label and relationships type in our HP network, so we will only focus on filtering by properties.\n",
    "We will begin by using the subgraph filtering to create a new projected in-memory graph that holds only relationships that have the weight property greater than 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "8cc99dd9",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 81
    },
    "id": "8cc99dd9",
    "outputId": "aa672058-f0f4-4f34-82dd-d2b3e12e0617"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fromGraphName</th>\n",
       "      <th>nodeFilter</th>\n",
       "      <th>relationshipFilter</th>\n",
       "      <th>graphName</th>\n",
       "      <th>nodeCount</th>\n",
       "      <th>relationshipCount</th>\n",
       "      <th>projectMillis</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>interactions</td>\n",
       "      <td>*</td>\n",
       "      <td>r.weight &gt; 1.0</td>\n",
       "      <td>wgt1</td>\n",
       "      <td>120</td>\n",
       "      <td>474</td>\n",
       "      <td>214</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  fromGraphName nodeFilter relationshipFilter graphName  nodeCount  \\\n",
       "0  interactions          *     r.weight > 1.0      wgt1        120   \n",
       "\n",
       "   relationshipCount  projectMillis  \n",
       "0                474            214  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "CALL gds.beta.graph.project.subgraph(\n",
    "  'wgt1', // name of the new projected graph\n",
    "  'interactions', // name of the existing projected graph\n",
    "  '*', // node predicate filter\n",
    "  'r.weight > 1.0' // relationship predicate filter\n",
    ")\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fd3ed84",
   "metadata": {
    "id": "2fd3ed84"
   },
   "source": [
    "The wildcard operator `*` is used to define that we don't want to apply any filtering. In this case, we have only filtered relationships, but kept all the nodes. The predicate syntax is similar to Cypher query. The relationship entity is always identified by `r` and the node entity is identified with variable `n`.\n",
    "\n",
    "We can go ahead and run the WCC algorithm on the new in-memory graph that we created with the subgraph filtering. It is available under the wgt1name, for a lack of better name nomenclature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "b76e6744",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 81
    },
    "id": "b76e6744",
    "outputId": "c89d719f-2f05-4c54-c1b9-59b9bfecc731"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>componentCount</th>\n",
       "      <th>componentDistribution</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>43</td>\n",
       "      <td>{'p99': 78, 'min': 1, 'max': 78, 'mean': 2.790...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   componentCount                              componentDistribution\n",
       "0              43  {'p99': 78, 'min': 1, 'max': 78, 'mean': 2.790..."
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "CALL gds.wcc.mutate('wgt1', {mutateProperty:'wcc'})\n",
    "YIELD componentCount, componentDistribution\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92253ee6",
   "metadata": {
    "id": "92253ee6"
   },
   "source": [
    "The filtered projected graph has 43 components. This is reasonable as we ignored all the relationships that have the weight property equal or smaller to one, but left all the nodes. We have used the mutate mode of the algorithm to write the results back to the projected in-memory graph.\n",
    "Let's say we now want to run Eigenvector centrality only on the largest component of the graph. First, we need to identify the ID of the largest component. The results of the WCC algorithm are stored in the projected in-memory graph, so we need to use the `gds.graph.streamNodeProperties` procedure to access the WCC results and identify the largest component."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "c3e69b3d",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "c3e69b3d",
    "outputId": "6475bbba-a2bf-40ca-abb3-fb875b64cc3a"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>component</th>\n",
       "      <th>componentSize</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>78</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>14</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>16</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   component  componentSize\n",
       "0          0             78\n",
       "1          5              1\n",
       "2         14              1\n",
       "3         16              1\n",
       "4          4              1"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "CALL gds.graph.nodeProperty.stream('wgt1', 'wcc') \n",
    "YIELD propertyValue\n",
    "RETURN propertyValue as component, count(*) as componentSize\n",
    "ORDER BY componentSize DESC \n",
    "LIMIT 5\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b1d0e67",
   "metadata": {
    "id": "4b1d0e67"
   },
   "source": [
    "As we saw before, the largest component has 78 members and its id is 0. We can use the subgraph filtering feature to create a new projected in-memory graph that contains only the nodes in the largest component."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a01f9b8d",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 81
    },
    "id": "a01f9b8d",
    "outputId": "5e46b7b2-db7b-4e5c-8fc3-d5674699cdb6"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fromGraphName</th>\n",
       "      <th>nodeFilter</th>\n",
       "      <th>relationshipFilter</th>\n",
       "      <th>graphName</th>\n",
       "      <th>nodeCount</th>\n",
       "      <th>relationshipCount</th>\n",
       "      <th>projectMillis</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>wgt1</td>\n",
       "      <td>n.wcc = 0</td>\n",
       "      <td>*</td>\n",
       "      <td>largest_community</td>\n",
       "      <td>78</td>\n",
       "      <td>474</td>\n",
       "      <td>285</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  fromGraphName nodeFilter relationshipFilter          graphName  nodeCount  \\\n",
       "0          wgt1  n.wcc = 0                  *  largest_community         78   \n",
       "\n",
       "   relationshipCount  projectMillis  \n",
       "0                474            285  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "CALL gds.beta.graph.project.subgraph(\n",
    "  'largest_community', \n",
    "  'wgt1',\n",
    "  'n.wcc = 0', \n",
    "  '*'\n",
    ")\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7820a73c",
   "metadata": {
    "id": "7820a73c"
   },
   "source": [
    "Now, we can go ahead and accomplish our task by running the Eigenvector centrality on the largest component only."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "8a38468e",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "8a38468e",
    "outputId": "14ca9da8-9033-460b-c011-f1eb4b21b6dd"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>character</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Harry Potter</td>\n",
       "      <td>0.415443</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Ronald Weasley</td>\n",
       "      <td>0.296247</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Rubeus Hagrid</td>\n",
       "      <td>0.291302</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Hermione Granger</td>\n",
       "      <td>0.272300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Severus Snape</td>\n",
       "      <td>0.239990</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          character     score\n",
       "0      Harry Potter  0.415443\n",
       "1    Ronald Weasley  0.296247\n",
       "2     Rubeus Hagrid  0.291302\n",
       "3  Hermione Granger  0.272300\n",
       "4     Severus Snape  0.239990"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "CALL gds.eigenvector.stream('largest_community')\n",
    "YIELD nodeId, score\n",
    "RETURN gds.util.asNode(nodeId).name as character, score\n",
    "ORDER BY score DESC\n",
    "LIMIT 5\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95cbe209",
   "metadata": {
    "id": "95cbe209"
   },
   "source": [
    "In the last example, I will show how you can combine multiple node and relationship predicates using the `AND` or `OR` logical operators. First we will run the degree centrality on the interactions network and store the results to the projected graph using the mutate mode."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "96ae8a92",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 81
    },
    "id": "96ae8a92",
    "outputId": "36b2f67a-4107-41ad-9142-c0371853f032"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>nodePropertiesWritten</th>\n",
       "      <th>centralityDistribution</th>\n",
       "      <th>mutateMillis</th>\n",
       "      <th>postProcessingMillis</th>\n",
       "      <th>preProcessingMillis</th>\n",
       "      <th>computeMillis</th>\n",
       "      <th>configuration</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>120</td>\n",
       "      <td>{'p99': 41.00023651123047, 'min': 0.0, 'max': ...</td>\n",
       "      <td>0</td>\n",
       "      <td>200</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>{'jobId': 'a0feb736-da3c-499f-abdf-2521135e1a5...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   nodePropertiesWritten                             centralityDistribution  \\\n",
       "0                    120  {'p99': 41.00023651123047, 'min': 0.0, 'max': ...   \n",
       "\n",
       "   mutateMillis  postProcessingMillis  preProcessingMillis  computeMillis  \\\n",
       "0             0                   200                    0              0   \n",
       "\n",
       "                                       configuration  \n",
       "0  {'jobId': 'a0feb736-da3c-499f-abdf-2521135e1a5...  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "CALL gds.degree.mutate('interactions',\n",
    "  {mutateProperty:'degree'})\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e060bb8e",
   "metadata": {
    "id": "e060bb8e"
   },
   "source": [
    "Now we can go ahead and filter the subgraph by using multiple node and relationship predicates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "9287b823",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 81
    },
    "id": "9287b823",
    "outputId": "5ffa5221-f458-466a-85d8-df5b4f58d4e8"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fromGraphName</th>\n",
       "      <th>nodeFilter</th>\n",
       "      <th>relationshipFilter</th>\n",
       "      <th>graphName</th>\n",
       "      <th>nodeCount</th>\n",
       "      <th>relationshipCount</th>\n",
       "      <th>projectMillis</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>interactions</td>\n",
       "      <td>n.firstSeen &lt; 35583 AND n.degree &gt; 2.0</td>\n",
       "      <td>r.weight &gt; 5.0</td>\n",
       "      <td>first_half</td>\n",
       "      <td>40</td>\n",
       "      <td>82</td>\n",
       "      <td>88</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  fromGraphName                              nodeFilter relationshipFilter  \\\n",
       "0  interactions  n.firstSeen < 35583 AND n.degree > 2.0     r.weight > 5.0   \n",
       "\n",
       "    graphName  nodeCount  relationshipCount  projectMillis  \n",
       "0  first_half         40                 82             88  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "CALL gds.beta.graph.project.subgraph(\n",
    "  'first_half', // new projected graph\n",
    "  'interactions', // existing projected graph\n",
    "  'n.firstSeen < 35583 AND n.degree > 2.0', // node predicates\n",
    "  'r.weight > 5.0' // relationship predicates\n",
    ")\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcff01e2",
   "metadata": {
    "id": "dcff01e2"
   },
   "source": [
    "In the node predicate, we have selected nodes with a degree value greater than two and the first seen property smaller than 35583. For the relationship predicate, I have chosen only relationships that have a weight greater than five. We can run any of the graph algorithms on the newly filtered in-memory subgraph:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "09ebadcd",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "09ebadcd",
    "outputId": "ebdffecf-8596-4d65-b980-5bdc7177567e"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>character</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Harry Potter</td>\n",
       "      <td>0.544203</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Ronald Weasley</td>\n",
       "      <td>0.368860</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Rubeus Hagrid</td>\n",
       "      <td>0.318522</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Albus Dumbledore</td>\n",
       "      <td>0.270781</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Quirinus Quirrell</td>\n",
       "      <td>0.243933</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           character     score\n",
       "0       Harry Potter  0.544203\n",
       "1     Ronald Weasley  0.368860\n",
       "2      Rubeus Hagrid  0.318522\n",
       "3   Albus Dumbledore  0.270781\n",
       "4  Quirinus Quirrell  0.243933"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "CALL gds.eigenvector.stream('first_half')\n",
    "YIELD nodeId, score\n",
    "RETURN gds.util.asNode(nodeId).name as character, score\n",
    "ORDER BY score\n",
    "DESC LIMIT 5\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "beac9999",
   "metadata": {
    "id": "beac9999"
   },
   "source": [
    "Lastly, when you are done with your graph analysis, you can use the following Cypher query to drop all the projected in-memory graphs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "a1ffc507",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 175
    },
    "id": "a1ffc507",
    "outputId": "33588ffe-3676-437d-9e07-530238ba70ce"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>'dropped ' + graphName</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>dropped interactions</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>dropped wgt1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>dropped first_half</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>dropped got</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>dropped role2vec</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>dropped lp-graph</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>dropped stock</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>dropped largest_community</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>dropped starwars_cypher</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>dropped nomad</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      'dropped ' + graphName\n",
       "0       dropped interactions\n",
       "1               dropped wgt1\n",
       "2         dropped first_half\n",
       "3                dropped got\n",
       "4           dropped role2vec\n",
       "5           dropped lp-graph\n",
       "6              dropped stock\n",
       "7  dropped largest_community\n",
       "8    dropped starwars_cypher\n",
       "9              dropped nomad"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "run_query(\"\"\"\n",
    "CALL gds.graph.list() YIELD graphName\n",
    "CALL gds.graph.drop(graphName)\n",
    "YIELD database\n",
    "RETURN 'dropped ' + graphName\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "175a800a",
   "metadata": {
    "id": "175a800a"
   },
   "source": [
    "# Conclusion\n",
    "Subgraph filtering is a nice addition to the GDS library that allows smoother workflows. Instead of having to store the algoritm results back to Neo4j and use Native or Cypher projections to create a new in-memory graph, you can use subgraph filtering to filter an existing in-memory graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2386ba0",
   "metadata": {
    "id": "c2386ba0"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "include_colab_link": true,
   "name": "Subgraph filtering.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
